CRI RNAseq Pipelines

RUN AS PRACTICE

Pipeline package is available on GitHub

The Center for Research Informatics (CRI) provides computational resources and expertise in biomedical informatics for researchers in the Biological Sciences Division (BSD) of the University of Chicago.

As a bioinformatics core, we are actively improving our pipelines and expanding pipeline functions. The tutorials will be updated in a timely manner but may not reflect the newest updates of the pipelines. Stay tuned with us for the latest pipeline release.

If you have any questions, comments, or suggestions, feel free to contact our core at bioinformatics@bsd.uchicago.edu or one of our bioinformaticians.

Quick Start | Top

  1. modify the generator script Build_RNAseq.DLBC.sh accordingly
    1. project=“PROJECT_AS_PREFIX” (e.g., DLBC which is used as a prefix of metadata file DLBC.metadata.txt and configuration file DLBC.pipeline.yaml)
    2. padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., example which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)
  2. prepare metadata file as the example DLBC.metadata.txt
    1. Single End (SE) Library
      1. Set Flavor column as 1xReadLength (e.g., 1x50)
      2. Set Seqfile1 column as the file name of the repective sequencing file
    2. Paired End (PE) Library
      1. Set Flavor column as 2xReadLength (e.g., 2x50)
      2. Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
      3. Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
    3. Non strand-specific Library
      1. Set LibType column to NS
    4. Strand-specific Library
      1. Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
  3. prepare reference files as the example reference hg38 under [/CRI/HPC/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12)
  4. prepare pre-built STAR index files as the example reference hg38 under [/CRI/HPC/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12/STAR)

Introduction | Top

RNA sequencing (RNA-seq) is a revolutionary approach that uses the next-generation sequencing technologies to detect and quantify expressed transcripts in biological samples. Compared to other methods such as microarrays, RNA-seq provides a more unbiased assessment of the full range of transcripts and their isoforms with a greater dynamic range in expression quantification.

In this tutorial, you will learn how to use the CRI’s RNA-seq pipeline (available on both CRI HPC cluster and GitHub)) to analyze Illumina RNA sequencing data. The tutorial comprises the following Steps:

By the end of this tutorial, you will:

This tutorial is based on CRI’s high-performance computing (HPC) cluster. If you are not familiar with this newly assembled cluster, a concise user’s guide can be found here.

Work Flow | Top

The RNA-seq data used in this tutorial are from DLBC.

In this tutorial, we use the sequencing reads in the project DLBC in mouse as example. The sample information are saved in the file DLBC.metadata.txt (see below).

Work Flow

Work Flow

Data Description | Top

There are six (partial) single-end RNA-seq sequencing libraries will be used as the example dataset In this tutorial. The respective sample information is described in the metadata table example/DLBC.metadata.txt.

Sample Description
Sample Library ReadGroup LibType Platform SequencingCenter Date Lane Unit Flavor Encoding Run Genome NucleicAcid Group Location Seqfile1
KO01 KO01 SRR1205282 NS Illumina SRA 2015-07-22 7 FCC2B5CACXX 1x49 33 0 hg38 rnaseq KO example/data SRR1205282.fastq.gz
KO02 KO02 SRR1205283 NS Illumina SRA 2015-07-22 7 FCC2B5CACXX 1x49 33 0 hg38 rnaseq KO example/data SRR1205283.fastq.gz
KO03 KO03 SRR1205284 NS Illumina SRA 2015-07-22 7 FCC2B5CACXX 1x49 33 0 hg38 rnaseq KO example/data SRR1205284.fastq.gz
WT01 WT01 SRR1205285 NS Illumina SRA 2015-07-22 4 FCC2C3CACXX 1x49 33 0 hg38 rnaseq WT example/data SRR1205285.fastq.gz
WT02 WT02 SRR1205286 NS Illumina SRA 2015-07-22 4 FCC2C3CACXX 1x49 33 0 hg38 rnaseq WT example/data SRR1205286.fastq.gz
WT03 WT03 SRR1205287 NS Illumina SRA 2015-07-22 4 FCC2C3CACXX 1x49 33 0 hg38 rnaseq WT example/data SRR1205287.fastq.gz

Prerequisites | Top

We will use SSH (Secure Shell) to connect to CRI’s HPC. SSH now is included or can be installed in all standard operating systems (Windows, Linux, and OS X).

Login and Setup Tutorial Working Directory | Top

The login procedure varies slightly depending on whether you use a Mac/Unix/Linux computer or a Windows computer.

  • Log into one of entry nodes in CRI HPC
    1. Open a terminal session.
    2. Connect to the login node of the CRI HPC cluster:

      $ ssh -l username@hpc.cri.uchicago.edu
    3. If it’s your first time to log in, you will be asked to accept the ssh key. Type “yes
    4. Type in the password when prompted

      Make sure that you replace username with your login name.

  • CAUTION
    • THIS PACKAGE IS LARGE, PLEASE DO NOT DOWNLOAD IT TO YOUR HOME DIRECTORY
    • USE OTHER LOCATION LIKE /CRI/HPC/username
  • Set up a tutorial directory
    1. You should be in your home directory after logging in

      $ pwd
      /home/username
    2. Instead of downloading the pipeline package to your local home directory, use other location like /CRI/HPC/username

      $ cd /CRI/HPC/username; pwd
      /CRI/HPC/username
  • Download the pipeline package
    1. One way to download the pipeline package via git clone

      $ git clone git@github.com:wenching/cri_rnaseq_2018.git
    2. Or, download the latest package via ‘wget’

      $ wget https://github.com/wenching/cri_rnaseq_2018/archive/master.tar.gz .
      1. Uncompress the tarball file

        $ tar -zxvf master.tgz
      2. Change folder name

        $ mv cri_rnaseq_2018-master cri_rnaseq_2018
    3. Change working directory to pipeline dirctory

      $ cd cri_rnaseq_2018
      $ tree -d -L 4
      ## ../
      ## ├── SRC
      ## │   ├── Python
      ## │   │   ├── lib
      ## │   │   ├── module
      ## │   │   └── util
      ## │   └── R
      ## │       ├── module
      ## │       └── util
      ## ├── docs
      ## │   ├── IMG
      ## │   └── result
      ## └── example
      ##     ├── data
      ##     └── references
      ##         └── v28_92_GRCh38.p12
      ##             └── STAR
      ## 
      ## 16 directories
  • File structure
    • Raw sequencing data files (*.fastq.gz) are located at example/data/

      $ tree example/data/
      |-- SRR1205282.fastq.gz
      |-- SRR1205283.fastq.gz
      |-- SRR1205284.fastq.gz
      |-- SRR1205285.fastq.gz
      |-- SRR1205286.fastq.gz
      `-- SRR1205287.fastq.gz
    • Genome data are located at /CRI/HPC/ReferenceData/cri_rnaseq_2018/vM18_93_GRCm38.p6

      $ tree example/references/v28_92_GRCh38.p12
      |-- GRCh38_rRNA.bed
      |-- GRCh38_rRNA.bed.interval_list
      |-- STAR
      |   |-- Genome
      |   |-- Log.out
      |   |-- SA
      |   |-- SAindex
      |   |-- chrLength.txt
      |   |-- chrName.txt
      |   |-- chrNameLength.txt
      |   |-- chrStart.txt
      |   |-- exonGeTrInfo.tab
      |   |-- exonInfo.tab
      |   |-- geneInfo.tab
      |   |-- genomeParameters.txt
      |   |-- run_genome_generate.logs
      |   |-- sjdbInfo.txt
      |   |-- sjdbList.fromGTF.out.tab
      |   |-- sjdbList.out.tab
      |   `-- transcriptInfo.tab
      |-- genes.gtf
      |-- genes.gtf.bed12
      |-- genes.refFlat.txt
      |-- genome.chrom.sizes
      |-- genome.dict
      `-- genome.fa
  • Pipeline/project related files
    • project related files (i.e., metadata & configuration file) as used in this tutorial are located under example/

      $ ls -l example/DLBC.*
      ## example/DLBC.metadata.txt
      ## example/DLBC.pipeline.yaml
      • Here are the first few lines in the configuration example file example/DLBC.pipeline.yaml

        ---
        pipeline:
          flags:
            aligners:
              run_star: True
            quantifiers:
              run_featurecounts: True
              run_rsem: False
              run_kallisto: False
            callers:
              run_edger: True
              run_deseq2: True
              run_limma: True
          software:
            main:
              use_module: 0
              adapter_pe: AGATCGGAAGAGCGGTTCAG,AGATCGGAAGAGCGTCGTGT
              adapter_se: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
              fastq_format: 33
              genome_assembly: hg38

      When running on another dataset, you will need to modify these two files and the master pipeline script (i.e., Build_RNAseq.DLBC.sh) (as described below) accordingly.

      For instance, if you would like to turn off the DE analysis tool limma, you can set the respecitve paramter to ‘False’ in configuration file - run_limma: False

      For metadata file, you might pay attendtion on the following settings
      1. Single End (SE) Library
        1. Set Flavor column as 1xReadLength (e.g., 1x50)
        2. Set Seqfile1 column as the file name of the repective sequencing file
      2. Paired End (PE) Library
        1. Set Flavor column as 2xReadLength (e.g., 2x50)
        2. Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
        3. Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
      3. Non strand-specific Library
        1. Set LibType column to NS
      4. Strand-specific Library
        1. Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
    • Master pipeline script

      $ cat Build_RNAseq.DLBC.sh
      ## 
      ## 
      ## ## build pipeline scripts
      ## 
      ## now=$(date +"%m-%d-%Y_%H:%M:%S")
      ## 
      ## ## project info
      ## project="DLBC"
      ## SubmitRNAseqExe="Submit_${PWD##*/}.sh"
      ## padding="example/"
      ## 
      ## ## command
      ## echo "START" `date` " Running build_rnaseq.py"
      ## python3 SRC/Python/build_rnaseq.py \
      ##  --projdir $PWD \
      ##  --metadata $PWD/${padding}$project.metadata.txt \
      ##  --config $PWD/${padding}$project.pipeline.yaml \
      ##  --systype cluster \
      ##  --threads 8 \
      ##  --log_file $PWD/Build_RNAseq.$project.$now.log
      ## 
      ## ## submit pipeline master script
      ## echo "START" `date` " Running $SubmitRNAseqExe"
      ## echo "bash $SubmitRNAseqExe"
      ## 
      ## echo "END" `date`

      Basically, when running on your own dataset, you will need to modify this master pipeline script (i.e., Build_RNAseq.DLBC.sh) accordingly.

      For instance, you can change respective parameters as follows. - project=“PROJECT_AS_PREFIX” (e.g., DLBC which is used as a prefix of metadata file DLBC.metadata.txt and configuration file DLBC.pipeline.yaml) - padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., example which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)

Pipeline Steps | Top

BigDataScript Report | Top

Considering the environment setting in the CRI HPC system, BigDataScript was used as a job management system in the current development to achieve an automatic pipeline. It can handle the execution dependency of all sub-task bash scripts and resume from a failed point, if any.

After the completion of the entire pipeline, you will see a BigDataScript report in HTML under the pipeline folder. For instance, this is the report from one test run. The graphic timeline will tell you the execution time per sub-task script.

BDS_Report

[Last Updated on 2018/11/14] | Top